A Self-enriching Methodology for Clustering Narrow Domain Short Texts

نویسندگان

  • David Pinto
  • Paolo Rosso
  • Héctor Jiménez-Salazar
چکیده

s of Scientific Texts Using the Transition Point Technique. Proc. CICLing Conference—CICLing’06, Mexico city, Mexico, February 19–25, Lecture Notes in Computer Science 3878, pp. 536–546. Springer, Berlin. [24] Alexandrov, M., Gelbukh, A. and Rosso, P. (2005) An Approach to Clustering Abstracts. Proc. 10th Int. Conf.Application of Natural Language to Information Systems— NLDB’05, Alicante, Spain, June 15–17, Lecture Notes in Computer Science 3513, pp. 8–13. Springer, Berlin. [25] Makagonov, P., Alexandrov, M. and Sboychakov, K. (2000) Keyword-based Technology for Clustering Short Documents(Selected Papers). Comput. Res., 2, 105–114. [26] Homayouni, R., Heinrich, K., Wei, L. and Berry, M.W. (2005) Gene clustering by latent semantic indexing of medline abstracts.Bioinformatics, 21, 104–115. [27] Buscaldi, D., Juan, A., Rosso, P. and Alexandrov, M. (2006) Sense Cluster-based Categorization and Clustering of Abstracts. Proc. CICLing Conf.—CICLing’06, Mexico city, Mexico, February 19–25, Lecture Notes in Computer Science 3878, pp. 547–550. Springer, Berlin. [28] Ingaramo, D., Pinto, D., Rosso, P. and Errecalde, M. (2008) Evaluation of Internal Validity Measures in Short-Text Corpora. Proc. CICLing Conference—CICLing’08, Haifa, Israel, February 17–23, Lecture Notes in Computer Science 4919, pp. 555–567.Springer, Berlin. [29] Stein, B. and Nigemman, O. (1999) On the Nature of Structureand its Identification. Proc. 25th Int. Workshop on Graph-Theoretic Concepts in Computer Science—WG’99, Ascona,Switzerland, June 17–19, Lecture Notes in Computer Science1665, pp. 122–134. Springer, Berlin. [30] Fellbaum, C. (1998) WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA, USA. [31] Montejo-Ráez, A. (2006) Automatic text categorization ofdocuments in the high energy physics domain. PhD Dissertation,Granada University, Spain. [32] Pinto, D. and Rosso, P. (2006) KnCr: A Short-Text Narrow-Domain Sub-corpus of Medline. Proc. Human LanguageTechnologies Conference—TLH’06, San Luis Potosí, Mexico,September 18–20, Advances in Computer Science, pp. 266–269.ENC, Mexico. [33] Debole, F. and Sebastiani, F. (2005) An analysis of the relativehardness of Reuters-21578 subsets. J. Am. Soc. Inf. Sci. Technol.,56, 584–596. [34] Wibowo, W. and Williams, H.E. (1999) On Using Hierarchiesfor Document Classification. Proc. 4th Australasian Document Computing Symposium, Coffs Harbour, Australia, December, pp. 31–37. Australasian Document Computing Symposium,

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

KnCr: A Short-Text Narrow-Domain Sub-Corpus of Medline

Clustering of short texts in narrow domains is one of the most difficult tasks due to the high overlapping of vocabularies among the texts and also to the specific terminology used by researchers. Here, we are presenting a new corpus of scientific texts in medicine domain, specifically about “Cancer” topics. This corpus is a subset of the last MEDLINE sample, made up of 900 abstracts of 16 diff...

متن کامل

Clustering Narrow-Domain Short Texts by Using the Kullback-Leibler Distance

Clustering short length texts is a difficult task itself, but adding the narrow domain characteristic poses an additional challenge for current clustering methods. We addressed this problem with the use of a new measure of distance between documents which is based on the symmetric Kullback-Leibler distance. Although this measure is commonly used to calculate a distance between two probability d...

متن کامل

Sense Cluster Based Categorization and Clustering of Abstracts

This paper focuses on the use of sense clusters for classification and clustering of very short texts such as conference abstracts. Common keyword-based techniques are effective for very short documents only when the data pertain to different domains. In the case of conference abstracts, all the documents are from a narrow domain (i.e., share a similar terminology), that increases the difficult...

متن کامل

An Approach to Clustering

Free access to full-text scientific papers in major digital libraries and other web repositories is limited to only their abstracts consisting of no more than several dozens of words. Current keyword-based techniques allow for clustering such type of short texts only when the data set is multi-category, e.g., some documents are devoted to sport, others to medicine, others to politics, etc. Howe...

متن کامل

Characterizing Weblog Corpora

In order to exploit the huge volume of information being published in the blogosphere, it is essential to provide techniques such as clustering, which can automatically analyze and classify their contents. However these typically can produce better results when dealing with wide domain full-text documents. In most cases however, blogs can be considered to be “short texts”, i.e., they are not ex...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Comput. J.

دوره 54  شماره 

صفحات  -

تاریخ انتشار 2011